Rule-based Automatic Post-processing of SMT Output to Reduce Human Post-editing Effort

نویسندگان

Victoria Porro

Johanna Gerlach

Pierrette Bouillon

Violeta Seretan

چکیده

To enhance sharing of knowledge across the language barrier, the ACCEPT project focuses on improving machine translation of user-generated content by investigating preand postediting strategies. Within this context, we have developed automatic monolingual post-editing rules for French, aimed at correcting frequent errors automatically. The rules were developed using the Acrolinx IQ technology, which relies on shallow linguistic analysis. In this paper, we present an evaluation of these rules, considering their impact on the readability of MT output and their usefulness for subsequent manual post-editing. Results show that the readability of a high proportion of the data is indeed improved when automatic post-editing rules are applied. Their usefulness is confirmed by the fact that a large share of the edits brought about by the rules are in fact kept by human post-editors. Moreover, results reveal that edits which improve readability are not necessarily the same as those preserved by post-editors in the final output, hence the importance of considering both readability and post-editing effort in the evaluation of post-editing strategies. 1. Introducción Since the emergence of the Web 2.0 paradigm, user-generated content (UGC) represents a large share of the informative content available nowadays. Online communities share technical information and exchange solutions to technical issues through forums and blogs. However, the uneven quality of UGC can hinder both readability and machine-translatability, thus preventing sharing of knowledge between language communities (Jiang et al., 2012; Roturier and Bensadoun, 2011). The ACCEPT project 1 aims to improve the Statistical Machine Translation (SMT) of community content through minimally-intrusive pre-editing techniques, SMT improvement methods and post-editing strategies. The project targets two specific data domains: the technical forum domain, represented by posts in the Norton Community forum, and the medical domain, illustrated by Translators without Borders documents written by health professionals. 1 http://www.accept-project.eu/ Translating and The Computer 36 67 During the first year of the project, we found that pre-editing forum data significantly improves MT output quality (Lehmann et al., 2012; Gerlach et al., 2013a). Further work (Gerlach et al., 2013b) has shown that pre-editing which improves SMT output quality also has a positive impact on bilingual post-editing time. We are now developing post-editing rules intended to reduce post-editing effort, by automatically correcting the most frequent errors before submitting MT output to the post-editor. This study focuses on the evaluation of the post-editing rules developed for French, and more specifically, on automatic rules designed for monolingual application. In the related literature, there are several studies describing post-editing rules and evaluating them using automatic metrics or fluency-adequacy measures (Guzman, 2008; Valotkaite et al., 2012). However, to our knowledge, few such studies look into the actual use of the modifications produced by rules. We will assess: (1) the impact of the rules on the readability of the MT output and (2) their usefulness during the subsequent manual post-editing phase. Our study relies on the following hypotheses: (1) the changes produced by our automatic monolingual rules contribute to making the text more readable; (2) automatic post-editing produces useful changes for the post-editing task and reduces technical effort; and (3) readability and usefulness for post-editing do not necessarily go hand in hand. The paper is organised as follows. In Section 2, we show how post-editing research is performed in ACCEPT and describe the rules developed for French. In Section 3, we describe the experimental setup and provide details about data, tasks and participants. The results are analysed in Section 4, and conclusions and future work are presented in Section 5. 2. Post-editing in ACCEPT In the ACCEPT project, post-editing rules, as well as pre-editing rules, are developed using the technology developed by one of our project partners, i.e., the Acrolinx IQ engine (Bredenkamp et al, 2000). This rule-based engine uses a combination of shallow NLP components enabling the development of declarative rules, written in a formalism similar to regular expressions, based on the syntactic tagging of the text. A sample rule is displayed in Figure 1.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Evaluation of Statistical Post-Editing Systems Applied to RBMT and SMT Systems

Statistical post-editing (SPE) of the output produced by rule-based MT (RBMT) systems has been reported to produce extraordinary BLEU (and other automatic evaluation) score improvements. SPE has also been applied to the output of statistical MT (SMT) systems, albeit with more mixed results. We present a statistical post-editing pipeline and evaluate the outputs using automatic and human evaluat...

متن کامل

PEPr: Post-Edit Propagation Using Phrase-based Statistical Machine Translation

Translators who work by post-editing machine translation output often find themselves repeatedly correcting the same errors. We propose a method for Post-edit Propagation (PEPr), which learns posteditor corrections and applies them on-thefly to further MT output. Our proposal is based on a phrase-based SMT system, used in an automatic post-editing (APE) setting with online learning. Simulated e...

متن کامل

A Three-Layer Architecture for Automatic Post-Editing System Using Rule-Based Paradigm

This paper proposes a post-editing model in which our three-level rule-based automatic post-editing engine called Grafix is presented to refine the output of machine translation systems. The type of corrections on sentences varies from lexical transformation to complex syntactical rearrangement. The experimental results both in manual and automatic evaluations show that the proposed system is a...

متن کامل

Depfix, a Tool for Automatic Rule-based Post-editing of SMT

We present Depfix, an open-source system for automatic post-editing of phrase-based machine translation outputs. Depfix employs a range of natural language processing tools to obtain analyses of the input sentences, and uses a set of rules to correct common or serious errors in machine translation outputs. Depfix is currently implemented only for English-to-Czech translation direction, but exte...

متن کامل

Automatic Post-Editing based on SMT and its selective application by Sentence-Level Automatic Quality Evaluation

In the computing assisted translation process with machine translation (MT), postediting costs time and efforts on the part of human. To solve this problem, some have attempted to automate post editing. Post-editing isn’t always necessary, however, when MT outputs are of adequate quality for human. This means that we need to be able to estimate the translation quality of each translated sentenc...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2015

Rule-based Automatic Post-processing of SMT Output to Reduce Human Post-editing Effort

نویسندگان

چکیده

منابع مشابه

An Evaluation of Statistical Post-Editing Systems Applied to RBMT and SMT Systems

PEPr: Post-Edit Propagation Using Phrase-based Statistical Machine Translation

A Three-Layer Architecture for Automatic Post-Editing System Using Rule-Based Paradigm

Depfix, a Tool for Automatic Rule-based Post-editing of SMT

Automatic Post-Editing based on SMT and its selective application by Sentence-Level Automatic Quality Evaluation

عنوان ژورنال:

اشتراک گذاری